-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
@martin-frbg @Mousius can you please have a look? |
…emm. Co-authored-by: Chris Sidebottom <[email protected]>
…ng of copyright notices added in last commit.
8316dc1 to
0a205ee
Compare
|
Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release. |
The speedup is one thing, the other is enabling GEMM kernels without the need to also implement the other kernels. This would enable us to land existing SME kernels, as previously proposed here: #5011 (comment) Putting off until after 0.3.31 makes perfect sense to me as it's a relatively high risk change. |
Hi Martin. Thanks for taking a look. I'm happy for this to go in after the release. Do you have a rough estimate for when that might be, so I can share internally? |
|
@martin-frbg congrats on getting the release out. Would it be possible to get this merged? |
|
I'm probably missing something obvious, but I'm still a bit confused why this needs the addition of "COMM" kernels instead of declaring dedicated ?TRMM and ?SYMM COPY kernels within the existing framework, like the riscv64 folks do in e.g. KERNEL.RISCV64_ZVL128B ? |
The intent is to capture what is still common between trmm and symm, and reduce duplicate code. I can refactor if preferred? @Mousius let me know if you have opinions either way. No preference on my part. |
|
I agree with @martin-frbg, it'd be better to name them aligned with the algorithms they're influencing. @martin-frbg we're targeting these types of blocks: OpenBLAS/driver/level3/trmm_L.c Lines 151 to 161 in 1a9cf8e
I think that'd be adding |
|
Yes, at least that's my current understanding - that mapping ?SYMM_ONCOPY et al in the KERNEL file should do. I haven't tried it though, as my lowly M4 doesn't handle the SVE bits - guess I'd need to look into hijacking our Numpy-sponsored AWS CI job to run on M9g ? |
|
I don't think we currently have OpenBLAS/kernel/riscv64/KERNEL.RISCV64_ZVL128B Lines 169 to 172 in 1a9cf8e
Which was originally done for SVE: OpenBLAS/kernel/arm64/KERNEL.ARMV8SVE Lines 149 to 152 in 1a9cf8e
So we're aware of these ones being overridable, but the For a sanity check, I couldn't find any reference to |
This change adds interleaving to sgemm and dgemm copies and kernels for ARMV8SVE.
This required a degree of disentangling symm and trmm kernels from gemm. It should now be much easier to apply further optimisations to gemm.
The addition of interleaving provides a ~1.4% speedup on c7g (V1), with negligible changes on c8g (V2).
Taken over square matrix operations with size 2->2014, stepsize = 1:
Geometric mean for interleave/c7g_dgemm.txt: 0.9859023206257058
Geometric mean for interleave/c7g_sgemm.txt: 0.9887890902680289
Geometric mean for interleave/c8g_dgemm.txt: 0.9970050554316875
Geometric mean for interleave/c8g_sgemm.txt: 0.9948135816755502
We see an increase in the sgemm speedup (~2.4%) on c7g for larger matrix sizes.
Taken over square matrix operations with size 2,000->10,000, stepsize = 1,000:
Geometric mean for 64thread_interleave/c7g_dgemm.txt: 0.9865252964543917
Geometric mean for 64thread_interleave/c7g_sgemm.txt: 0.9762227312411808
Geometric mean for 64thread_interleave/c8g_dgemm.txt: 0.9997186302044462
Geometric mean for 64thread_interleave/c8g_sgemm.txt: 0.9996022927667269